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Abstract 

Standard approaches to analysing data in genome-wide association studies (GWAS) 
ignore any potential functional relationships between genetic markers. In contrast gene 
pathways analysis uses prior information on functional structure within the genome to 
identify pathways associated with a trait of interest. In a second step, important single 
nucleotide polymorphisms (SNPs) or genes may be identified within associated pathways. 
The pathways approach is motivated by the fact that many genes do not act alone, but in- 
stead have effects that are likely to be mediated through their interaction in gene pathways. 
Where this is the case, pathways approaches may reveal aspects of a trait's genetic archi- 
tecture that would otherwise be missed when considering SNPs in isolation. Most path- 
ways methods begin by testing SNPs one at a time, and so fail to capitalise on the potential 
advantages inherent in a multi-SNP, joint modelling approach. Here we describe a dual- 
level, sparse regression model for the simultaneous identification of pathways, genes and 
SNPs associated with a quantitative trait. Our method takes account of various factors 
specific to the joint modelling of pathways with genome-wide data, including widespread 
correlation between genetic predictors, and the fact that variants may overlap multiple 
pathways. We use a resampling strategy that exploits finite sample variability to provide 
robust rankings for pathways, SNPs and genes. We test our method through simulation, 
and use it to perform pathways- driven SNP selection in a search for pathways, genes and 
SNPs associated with variation in serum high-density lipoprotein cholesterol (HDLC) lev- 
els in two separate GWAS cohorts of Asian adults. By comparing results from both cohorts 
we identify a number of candidate pathways including those associated with cardiomyopa- 
thy, and T cell receptor and PPAR signalling. Highlighted genes include those associated 
with the L-type calcium channel, adenylate cyclase, integrin, laminin, MAPK signalling 
and immune function. Software implementing the methods described here, together with 
sample data is available at http:// www 2. imperial, ac.uk/ ^gmontana/ psrrr.htm. 



Introduction 



Much attention continues to be focused on the problem of identifying SNPs and genes influ- 
encing a quantitative or dichotomous trait in genome wide scans [53]. Despite this, in many 
instances gene variants identified in GWAS have so far uncovered only a relatively small part 
of the known heritability of most common diseases [84]. Possible explanations include the 
presence of multiple SNPs with small effects, or of rare variants, which may be hard to detect 
using conventional approaches [84, 52, 28]. 

One potentially powerful approach to uncovering the genetic etiology of disease is moti- 
vated by the observation that in many cases disease states are likely to be driven by multiple 
genetic variants of small to moderate effect, mediated through their interaction in molecular 
networks or pathways, rather than by the effects of a few, highly penetrant mutations [63]. 
Where this assumption holds, the hope is that by considering the joint effects of variants 
acting in concert, pathways GWAS methods will reveal aspects of a disease's genetic architec- 
ture that would otherwise be missed when considering variants individually [86, 25]. In this 
section we describe a sparse regression method utilising prior information on gene pathways 
to identify putative causal pathways, along with the constituent SNPs and genes that may 
be driving pathways association. 

Sparse modelling approaches are becoming increasingly popular for the analysis of genome 
wide datasets [65, 17, 3, 88]. Sparse regression models enable the joint modelling of large 
numbers of SNP predictors, and perform 'model selection' by highlighting small numbers of 
variants influencing the trait of interest. These models work by penalising or constraining the 
size of estimated regression coefficients. An interesting feature of these methods is that dif- 
ferent sparsity patterns, that is different sets of genetic predictors having specified properties, 
can be obtained by varying the nature of this constraint. For example, the lasso [78] selects a 
subset of variants whose main effects best predict the response. Where predictors are highly 
correlated, the lasso tends to select one of a group of correlated predictors at random. In 
contrast, the elastic net [92] selects groups of correlated variables. Model selection may also 
be driven by external information, unrelated to any statistical properties of the data being 
analysed. For example, the fused lasso [80, 79] uses ordering information, such as the position 
of genomic features along a chromosome to select 'adjacent' features together. 

Prior information on functional relationships between genetic predictors can also been used 
to drive the selection of groups of variables. In the present context, information mapping genes 
and SNPs to functional gene pathways has recently been used in sparse regression models for 
pathway selection. Chen et al. [15] describe a method that uses a combination of lasso 
and ridge regression to assess the significance of association between a candidate pathway 
and a dichotomous (case-control) phenotype, and apply this method in a study of colon 
cancer etiology. In contrast. Silver et al. [67] use group lasso penalised regression to select 
pathways associated with a multivariate, quantitative brain-imaging phenotype characteristic 
of structural change in the brains of patients with Alzheimer's disease. 

In identifying pathways associated with a trait of interest, a natural follow-up question is to 
ask which SNPs and/or genes are driving pathway selection? We might further ask a related 
question: can the use of prior information on putative gene interactions within pathways 
increase power to identify causal SNPs, compared to alternative methods that disregard such 
information? One way to answer these questions is by conducting a two-stage analysis, in 
which we first identify important pathways, and then in a second step search for SNPs within 
selected pathways [20, 21]. There are however a number of problems with this approach. 
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Firstly, highlighted SNPs are then not necessarily those that were driving pathway selection 
in the first step of the analysis. Secondly, the implicit (and reasonable) assumption is that 
only a small number of SNPs in a pathway are driving pathway selection, so that ideally we 
would prefer a model that has this assumption built in. The above considerations point to 
the use of a 'dual-level' sparse regression model that imposes sparsity at both the pathway 
and SNP level. Such a model would perform simultaneous pathway and SNP selection, with 
the additional benefit of being simpler to implement. 

A suitable sparse regression model enforcing the required dual-level sparsity is the sparse 
group lasso (SGL) [69]. SGL is a comparatively recent development in sparse modelling, and 
in simulations has been shown to accurately recover dual-level sparsity, in comparison to both 
the group lasso and lasso [26, 69]. SGL has been used for the identification of rare variants 
in a case-control study by grouping SNPs into genes [91]; for the identification of genomic 
regions whose copy number variations have an impact on RNA expression levels [59]; and 
to model geographical factors driving climate change [14]. SGL can be seen as fitting into a 
wider class of structured-sparsity inducing models that use prior information on relationships 
between predictors to enforce different sparsity patterns [90, 38, 41]. 

In the next section (Methods) we outline our method for sparse, pathways-driven SNP 
selection, and demonstrate through simulation that the incorporation of prior information 
mapping SNPs to gene pathways can boost the power to detect SNPs associated with a 
quantitative trait. In the following section (Results), we describe an application study, in 
which we investigate pathways, SNPs and genes associated with serum high-density lipopro- 
tein cholesterol (HDLC) levels in two separate cohorts of Asian adults. HDLC refers to 
the cholesterol carried by small lipoprotein molecules, so called high density lipoproteins 
(HDLs). HDLs help remove the cholesterol aggregating in arteries, and are therefore pro- 
tective against cardiovascular diseases [81]. Serum HDLC levels are genetically heritable 
(/i^ = 0.485) [57]. GWAS studies have now uncovered more than 100 HDLC associated loci 
(see www.genome.gov/gwastudies, Hindorff et al. [32]). However, considering serum lipids as 
a whole, variants so far identified account for only 25-30% of the genetic variance, highlighting 
the limited power of current methodologies to detect hidden genetic factors [76] . 

Methods 

This section is organised as follows. We begin by introducing the sparse group lasso (SGL) 
model for pathways-driven SNP selection, along with an efficient estimation algorithm, for the 
case of non-overlapping pathways. We then describe a simulation study illustrating superior 
group (pathway) and variant (SNP) selection performance in the case that the true supporting 
model is group-sparse. We continue by extending the previous model to the case of overlapping 
pathways. In principle, we can then solve this model using the estimation algorithm described 
for the non-overlapping case. However, we argue that this approach does not give us the 
outcome we require. For this reason we describe a modified estimation algorithm that assumes 
pathway independence, and demonstrate in a simulation study that this new algorithm is 
able to identify the correct SNPs and pathways with improved sensitivity and specificity. We 
complete this section with a description of a method to reduce bias in SNP and pathway 
selection, together with a subsampling procedure to rank SNPs and pathways in order of 
importance. 
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The sparse group lasso model 



We arrange the observed values for a univariate quantitative trait or phenotype, measured for 
N unrelated individuals, in an {N x 1) response vector y. We assume minor allele counts for 
P SNPs are recorded for all individuals, and denote by Xij the minor allele count for SNP j 
on individual i. These are arranged in an (N x P) genotype design matrix X. Phenotype and 
genotype vectors are mean centred, and SNP genotypes are standardised to unit variance, so 
thatj:ixl = l,iovj = l,...,P. 

We assume that all P SNPs may be mapped to L groups or pathways, Gi C {1, . . . , P}, / = 
1, . . . , L, and begin by considering the case where pathways are disjoint or non-overlapping, 
so that Gi ^ Qv =0 for any I ^ I' . We denote the vector of SNP regression coefficients 
hy (3 — . . . , /3p), and additionally denote the matrix containing all SNPs mapped to 
pathway Qi by X/ = (x/^, • . . , xpj, where Xj = {xij^X2j^ • . . , x^j)' ^ is the column vector 
of observed SNP minor allele counts for SNP j, and Pi is the number of SNPs in Qi. We 
denote the corresponding vector of SNP coefficients by /3/ = /3/2, • • • , Ppi)- 

In general, where P is large, we expect only a small proportion of SNPs to be 'causal', 
in the sense that they exhibit phenotypic effects. A key assumption in pathways analysis is 
that these causal SNPs will tend to be enriched within a small set, C C {1, . . . , L}, of causal 
pathways, with \C\ ^ L, where \C\ denotes the size (cardinality) of C. We denote the set of 
causal SNPs mapping to pathway Qi by 5/, and make the further assumption that most SNPs 
in a causal pathway are non-causal, so that < P/, where denotes the size (cardinality) 
of 5/. A suitable sparse regression model imposing the required, dual-level sparsity pattern 
is the sparse group lasso (SGL). We illustrate the resulting causal SNP sparsity pattern in 
Figure 1, and compare it to that generated by the group lasso (GL), a group-sparse model 
that we used previously in a sparse regression method to identify gene pathways [66, 67]. 

Qi G2 Gs Gl 



group lasso 



12 3 



sparse group 
lasso 



Figure 1: Sparsity patterns enforced by the group lasso and sparse group lasso. The set 5 C {1, . . . , P} 
of causal SNPs influencing the phenotype are represented by boxes that are shaded grey. Causal SNPs 
are assumed to occur within a set C C {1, . . . , L} of causal pathways, ^1, . . . , Gl- Here C = {2, 3}. 
The group lasso enforces sparsity at the group or pathway level only, whereas the sparse group lasso 
additionally enforces sparsity at the SNP level. 



With the SGL [69], sparse estimates for the SNP coefficient vector, /3 are given by 

/3^^^ = argmin {-||y - Xf3\\l + (1 - a)Xj2^i\\f3i\\2 + aX\\f3\\i} (1) 



1=1 



where A (A > 0) and a {\a\ < 1) are parameters controlling sparsity, and wi is a pathway 
weighting parameter that may vary across pathways. (1) corresponds to an ordinary least 
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squares (OLS) optimisation, but with two additional constraints on the coefficient vector, /3, 
that tend to shrink the size of /3, relative to OLS estimates. One constraint imposes a group 
lasso- type penalty on the size (£2 norm) of /3^, / = 1, . . . , L. Depending on the values of A, a 
and wi, this penalty has the effect of setting multiple pathway SNP coefficient vectors, /3i = 0, 
thereby enforcing sparsity at the pathway level. Pathways with non-zero coefficient vectors 
form the set C of 'selected' pathways, so that 

C(A,a) = 0: A^O}. 

A second constraint imposes a lasso-type penalty on the size {£1 norm) of /3. Depending on 
the values of A and a, for a selected pathway / G C, this penalty has the effect of setting 
multiple SNP coefficient vectors, = 0, j C 5^, thereby enforcing sparsity at the SNP level 
within selected pathways. SNPs with non-zero coefficient vectors then form the set 5/ of 
selected SNPs in pathway /, so that 

Si{X,a)^{j:Pjy^O,jegi}. 

The set of all selected SNPs is given by 

The sparsity parameter A controls the degree of sparsity in /3, such that the number of 
pathways and SNPs selected by the model increases as A is reduced from a maximal value 
^max, above which /3 = 0. The parameter a controls how the sparsity constraint is distributed 
between the two penalties. When a = 0, (1) reduces to the group lasso, so that sparsity is 
imposed only at the pathway level, and all SNPs within a selected pathway have non-zero 
coefficients. When < a < 1, solutions exhibit dual-level sparsity, such that as a approaches 
from above, greater sparsity at the group level is encouraged over sparsity at the SNP level. 
When a = 1, (1) reverts to the lasso, so that pathway information is ignored. 

Model estimation 

For the estimation of /3^^^ we proceed by noting that the optimisation (1) is convex, and 
(in the case of non-overlapping groups) that the penalty is block-separable, so that we can 
obtain a solution using block, or group- wise coordinate gradient descent (BCGD) [82]. A 
detailed derivation of the estimation algorithm is given in the accompanying supplementary 
information. Section A. 

From (17) and (18), the criterion for selecting a pathway / is given by 

\\S{X'iri,aX)\\2>{l-a)Xwi, (2) 

and the criterion for selecting SNP j in selected pathway / by 

\\X'jri,j\\i > a\ (3) 

where f ^ = f ^ - Y^my^l ^l^l ^IJ ^ ~ ^k^j ^k^k are respectively the pathway and 
SNP partial residuals, obtained by regressing out the current estimated effects of all other 
pathways and SNPs respectively. The complete algorithm for SGL estimation using BCGD 
is presented in Box 1. 
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Box 1 SGL-BCGD estimation algorithm 



1. initialise /3 ^ 0. 

2. repeat: [pathway loop] 

for pathway / = 1, 2, . . . , L: 

if ||5(X^r/,aA)||2 < {l-a)Xwi 

else 

repeat: [SNP loop] 
for j = /i, . . . 

if /3j = : 

Newton update 

else: 

Newton update 



if f{f3l 



> 



I3j using (24) and (22) 
I3j using (21) and (22) 



/3, ^ 

until convergence of fBi [SNP loop] 
until convergence of /3 [pathway loop] 



3. /3 



SGL 



SGL simulation study 1 

We test the hypothesis that where causal SNPs are enriched in a given pathway, pathway- 
driven SNP selection using SGL will outperform simple lasso selection that disregards pathway 
information in a simple simulation study. We simulate P = 2500 genetic markers for N = 400 
individuals. Marker frequencies for each SNP are sampled independently from a multinomial 
distribution following a Hardy Weinberg equilibrium frequency distribution. SNP minor allele 
frequencies are sampled from a uniform distribution U[0.1, 0.5]. SNPs are distributed equally 
between 50 non-overlapping pathways, each containing 50 SNPs. 

We then test each competing method over 500 Monte Carlo (MC) simulations. At each 
simulation, a baseline univariate phenotype is sampled from A/'(10, 1). To generate genetic 
effects, we randomly select 5 SNPs from a single, randomly selected pathway Gi, to form the 
set 5 C ^/ of causal SNPs. Genetic effects are then generated as described in Section C. 

To enable a fair comparison between the two methods (SGL and lasso), we ensure that 
both methods select the same number of SNPs at each simulation. We do this by first 
obtaining the SGL solution, 5*^^^, with A = 0.85Xmax and a = 0.8, which ensures sparsity 
at both the pathway and SNP level. We use a uniform pathway weighting vector w = 1. We 
then compute the lasso solution using coordinate descent over a range of values for the lasso 
regularisation penalty. A, and choose the set 

5^^^^^(y) such that |5^^^^^(y)| = 

where 15*^^^ | is the number of SNPs previously selected by SGL, and |5^^^^^(A')| is the number 
of SNPs selected by the lasso with A = A^ We measure performance as the mean power to 
detect all 5 causal SNPs over 500 MC simulations, and test a range of genetic effect sizes 
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SNPs drawn from single pathway 



SNPs drawn at random 




0.04 0.06 

effect size, 7 



(a) 



0.04 0.06 

effect size, 7 



(b) 



Figure 2: SGL vs Lasso. Comparison of power to detect 5 causal SNPs. Each data point represents 
mean power over 500 MC simulations. Left: Causal SNPs drawn from single causal pathway. Right: 
Causal SNPs drawn at random. 



(7) (see C). In a follow up study, we compare the performance of the two methods in a 
scenario in which pathways information is uninformative. For this we repeat the previous 
simulations, but with 5 causal SNPs drawn at random from all 2500 SNPs, irrespective of 
pathway membership. Results are presented in Figure 2. 

Referring to Figure 2, we see that where causal SNPs are concentrated in a single causal 
pathway (Figure 2 - left), SGL demonstrates greater power (and equivalently specificity, since 
the total number of selected SNPs is constant), compared with the lasso, above a particular 
effect size threshold (here 7 ^ 0.04). Where pathway information is not important, that is 
causal SNPs are not enriched in any particular pathway (Figure 2 - right), SGL performs 
poorly. 

To gain a deeper understanding of what is happening here, we also consider the power 
distributions across all 500 MC simulations corresponding to each point in the plots of Figure 
2. These are illustrated in Figure 3. The top row of plots illustrates the case where causal 
SNPs are drawn from a single causal pathway. Here we see that there is a marked difference 
between the two distributions (SGL vs lasso). The lasso shows a smooth distribution in 
power, with mean power increasing with effect size. In contrast, with SGL the distribution is 
almost bimodal, with power typically either or 1, depending on whether or not the correct 
causal pathway is selected. This serves as an illustration of the advantage of pathway-driven 
SNP selection for the detection of causal SNPs in the case that pathways are important. As 
previously found by Zhou et al. [91] in the context of rare variants and gene selection, the 
joint modelling of SNPs within groups gives rise to a relaxation of the penalty on individual 
SNPs within selected groups, relative to the lasso. This can enable the detection of SNPs 
with small effect size or low MAF that are missed by the lasso, which disregards pathways 
information and treats all SNPs equally. Finally, where causal SNPs are not enriched in a 
causal pathway (bottom row of Figure 3), as expected SGL performs poorly. In this case SGL 
will only select a SNP where the combined effects of constituent SNPs in a pathway are large 
enough to drive pathway selection. 
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Figure 3: SGL vs Lasso. Distribution over 500 MC simulations of power to detect 5 causal SNPs. 
Each plot represents the power distribution at a single data point in Figure 2. The power distribution 
is discrete, since each method can identify 0,1,2,3,4 or 5 causal SNPs, with corresponding power 
0,0.2,0.4,0.6,0.8 or 1.0. Top row: Causal SNPs drawn from single causal pathway. Bottom row: 
Causal SNPs drawn at random. 



The problem of overlapping pathways 

The assumption that pathways are disjoint does not hold in practice, since genes and SNPs 
may map to multiple pathways (see Figure 7). This means that typically QiHGu ^ for some 
I ^ V . In the context of pathways-driven SNP selection using SGL, this has two important 
implications. Firstly, the optimisation (1) is no longer separable into groups (pathways), so 
that convergence using coordinate descent is no longer guaranteed [82] . Secondly, we wish to 
be able to select pathways independently, and the SGL model as previously described does 
not allow this . For example consider the case of an overlapping gene, that is a gene that 
maps to more than one pathway. If a SNP mapping to this gene is selected in one pathway, 
then it must be selected in each and every pathway containing the mapped gene, so that all 
pathways mapping to the gene are selected. We instead want to admit the possibility that 
the joint SNP effects in one pathway may be sufficient to allow pathway selection, while the 
joint effects in another pathway containing some of the same SNPs do not pass the threshold 
for pathway selection. 

A solution to both these problems is obtained by duplicating SNP predictors in X, so that 
SNPs belonging to more than one pathway can enter the model separately [39, 66]. The process 
works as follows. An expanded design matrix is formed from the column- wise concatenation of 
the L, [N X Pi) sub-matrices, X/, to form the expanded design matrix X* = [Xi, X2, . . . , X^,] 
of size {N X P*), where P* = P/. The corresponding P* x 1 parameter vector, /3*, is formed 
by joining the L, {Pi x 1) pathway parameter vectors, , so that ^* = [/3|^^2^ • • • ^^lT- 
Pathway mappings with SNP indices in the expanded variable space are reflected in updated 



8 



_JL_ 




Figure 4: Two pathways with partially overlapping causal SNPs. Causal SNPs (marked in grey) in 
the set Sk overlap both pathways, so that Sk = Gk ^Qi- Additional causal SNPs, Si fl \5/c, (marked 
in purple) occur in pathway / only. 



groups Ql^ . . . , The SGL estimator (1), adapted to account for overlapping groups, is then 
given by 

/3^^^* = argmin{-||y-X*/3*||2 + (l-a)A5^^ni/3ri|2 + aA||/3^ (4) 
^ 1=1 

With this overlap expansion, the model is then able to perform pathway and SNP selection in 
the way that we require, and the corresponding optimisation problem is amenable to solution 
using the BCGD estimation algorithm described in Box 1. However, for the purpose of 
pathways-driven SNP selection, the application of this algorithm presents a problem. This 
arises from the replication of overlapping SNP predictors in each group, X^*, that they occur. 

Consider for example the simple situation where there are two pathways, ^^*, containing 
sets of causal SNPs SI C ^* and 5f C 5* respectively. Here the * indicates that SNP indices 
refer to the expanded variable space. We begin by assuming that S^ and S^ contain the same 
SNPs, so that in the unexpanded variable space, Sk = 5/- 

We then proceed with BCGD by first estimating We assume that the correct SNPs 
are selected, so that {^S* 7^ : j G 5^}, and ^5* = otherwise. For the estimation of 
/37*, the estimated effect ^a^^* XWj, of these overlapping causal SNPs is removed from the 
regression, through its incorporation in the block residual f ,* = y — X^.-^^* XW"^. Since no 

•J k J J 

other causal SNPs exist in pathway Q^^ X^* f^* = 0, so that the criterion for pathway selection, 
\\S{X.fY\,aX)\\2 > (1 - (y)\wi (2) is not met. That is is not selected. 

Now consider the case where additional, non-overlapping causal SNPs, possibly with 
smaller effects, occur in ^^*, so that in the unexpanded variable space, Sk C Si. In other 
words, causal SNPs are partially overlapping (see Fig 4)^. During BCGD pathway is then 
less likely to be selected by the model, than would be the case if there were no overlapping 
SNPs, since once again the effects of overlapping causal SNPs, SkDSi = 5/e, are removed. 

For pathways-driven SNP selection, we will argue that we instead require that SNPs are 
selected in each and every pathway whose joint SNP effects pass a revised pathway selec- 
tion threshold, irrespective of overlaps between pathways. This is equivalent to the previous 
pathway selection criterion (2), but with the additional assumption that pathways are inde- 

^This is the situation for example where multiple causal genes overlap both pathways, but one or more 
additional causal genes occur in Qi. 
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pendent, in the sense that they do not compete in the model estimation process. We describe 
a revised estimation algorithm under the assumption of pathway independence below. 

We justify the strong assumption of pathway independence with the following argument. 
In reality, we expect that multiple pathways may simultaneously influence the phenotype, and 
we also expect that many such pathways will overlap, for example through their containing one 
or more 'hub' genes, that overlap multiple pathways [46, 48]. By considering each pathway 
independently, we aim to maximise the sensitivity of our method to detect these variants 
and pathways. In contrast, without the independence assumption, a competitive estimation 
algorithm will tend to pick out one from each set of similar, overlapping pathways, and 
miss potentially causal pathways and variants as a consequence. We illustrate this idea in 
the simulation study in the following section. One potential concern is that by not allowing 
pathways to compete against each other, specificity may be reduced, since too many pathways 
and SNPs may be selected. We discuss the issue of specificity further in the context of results 
from the simulation study. 

A detailed derivation of the SGL model estimation algorithm under the independence 
assumption is given in supplementary information. Section B. The main results are that the 
pathway (2) and SNP (3) selection criteria become 

||S'(X^y,aA)||2 > (1 - a)Xwi, and 

||Xjy||i >aA (5) 

respectively. The key difference is that partial derivatives and rij are replaced by y, that 
is each pathway is regressed against the phenotype vector y. This means that there is no 
block coordinate descent stage in the estimation, so that the revised algorithm utilises only 
coordinate gradient descent within each selected pathway. For this reason we use the acronym 
SGL-CGD for the revised algorithm, and SGL-BCGD for the previous algorithm using block 
coordinate gradient descent. The new algorithm is described in Box 2. 

Finally, we note that for SNP selection we are interested only in the set S of selected 
SNPs in the unexpanded variable space, and not the set 5* = {j* : /3* ^ 0, j* G {1, . . . , P*}}. 
Since, under the independence assumption, the estimation of each /3^* does not depend on the 
other estimates, /3^,fc 7^ /, we do not need to record separate coefficient estimates for each 
pathway in which a SNP is selected. Instead we need only record the set 5^, / G C of SNPs 
selected in each selected pathway. This has a useful practical implication, since we can avoid 
the need for an expansion of X or /3, and simply form the complete set of selected SNPs as 

S = [jSi. 

leC 

SGL simulation study 2 

We now explore some of the issues raised in the preceding section, specifically the potential 
impact on pathway and SNP selection power and specificity of treating the pathways as 
independent in the SGL estimation algorithm. We do this in a simulation study in which we 
simulate overlapping pathways. The simulation scheme is specifically designed to highlight 
differences in pathway and SNP selection with the independence assumption (using the SGL- 
CGD estimation algorithm in Box 2) and without it (using the standard SGL estimation 
algorithm in Box 1). 
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Box 2 SGL-CGD estimation algorithm for overlapping pathways 



1. initialise /3* ^ 0. 

2. for pathway / = 1, 2, . . . , L: 

if ||5(Xfy,aA)||2 < {l-a)Xwi 
$1 ^ 

else 

repeat: [CGD (SNP) loop] 
for j = . . . 

if ^* = : 



Newton update ^5** ^ /3* using (31) and (22) 



else: 



Newton update ^5** ^ /3* using (30) and (22) 
if/(A**)>J(A*)^ 



until convergence 

3. ^ f3'' 



SNPs with variable MAP are simulated using the same procedure described in the previous 
simulation study, but this time SNPs are mapped to 50 overlapping pathways, each containing 
30 SNPs. Each pathway overlaps any adjacent (by pathway index) pathway by 10 SNPs. This 
overlap scheme is illustrated in Figure 5 (a). 

As before we consider a range of overall genetic effect sizes, 7. A total of 2000 MC 
simulations are conducted for each effect size. At MC simulation z, we randomly select two 
adjacent pathways, GhGi+i where / G {1,...,49}. From these two pathways we randomly 
select 10 SNPs according to the scheme illustrated in Figure 5 (b). This ensures that causal 
SNPs overlap a minimum of 1, and a maximum of 2 pathways, with Sz C {Gi H \Gi-i) U 
{Gi+i n \Gi^2)- The true set of causal pathways, C, is then given by {/}, {/ + 1} or {/, / + 1} 
(although simulations where |C| = 1 will be extremely rare). Genetic effects on the phenotype 
are generated as described previously (Section C). 

SNP coefficients are estimated for each algorithm, SGL-BCGD and SGL-CGD, using the 
same regularisation with A = 0.85Xmax and a = 0.85 for both. 

The average number of pathways and SNPs selected by SGL-BCGD and SGL-CGD across 
all 2000 MC simulations is reported in Table 1. As expected, for both models, the number of 
selected variables (pathways or SNPs) increases with decreasing effect size, as the number of 
pathways close to the selection threshold set by Xmax increases. 

For each model, at MC simulation z we record the pathway and SNP selection power, 
\Cz r]Cz\/\Cz\ and 15^ nSz\/\Sz\ respectively. Since the number of selected variables can vary 
slightly between the two models, we also record false positive rates (FPR) for pathway and 
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Gi 

(b) 

Figure 5: SGL Simulation Study with overlapping pathways, (a): Illustration of pathway overlap 
scheme. The are 30 SNPs in each pathway. Pathways Quil = 1, • • • , 50) overlap each adjacent pathway 
by 10 SNPs. (b): Causal SNPs from adjacent pathways, / + 1 are randomly selected from the region 
marked in purple, ensuring that SNPs in S overlap a maximum of two pathways. 



Table 1: Simulation Study 2: Mean number of pathways and SNPs selected by each model at each 
effect size, 7, across 2000 MC simulations. 





7 

0.02 0.04 0.06 0.08 0.1 0.12 


pathways SGL-CGD 
SGL-BCGD 


5.8 5.9 5.4 4.8 3.9 3.2 
5.8 5.9 5.4 4.8 3.9 3.2 


SNPs SGL-CGD 
SGL-BCGD 


26.6 27.0 24.8 22.2 18.5 15.3 
28.8 29.3 26.7 23.6 19.4 15.8 
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SNP selection as |4 H \C^|/|4| and H respectively. 

The large possible variation in causal SNP distributions, causal SNP MAFs etc. make a 
comparison of mean power and FPR between the two methods somewhat unsatisfactory. For 
example, depending on effect size, a large number of simulations can have either very high, 
or very low pathway and SNP selection power, masking subtle differences in performance 
between the two methods. Since we are specifically interested in establishing the relative 
performance of the two methods, we instead illustrate the number of simulations at which 
one method outperforms the other across all 2000 MC simulations, and show this in Figure 
6. In this figure, the number of simulations in which SGL-CGD outperforms SGL, i.e. where 
SGL-CGD power > SGL-BCGD power, or SGL-CGD FPR < SGL-BCGD FPR, are shown 
in green. Conversely, the number of simulations where SGL-BCGD outperforms SGL-CGD 
are shown in red. 



Pathway selection power 

■ SGL-CGD > SGL-BCGD 

■ SGL-CGD < SGL-BCGD 



0.1 0.12 



Pathway selection FPR 
I SGL-CGD < SGL-BCGD 
I SGL-CGD > SGL-BCGD 



Q 



0.1 0.12 




Figure 6: SGL-CGD vs SGL-BCGD performance, measured across 2000 MC simulations. Top row: 
Pathway selection performance. (Left) green bars indicate the number of MC simulations where SGL- 
CGD has greater pathway selection power than SGL. Red bars indicate where SGL-BCGD has greater 
power than SGL-CGD. (Right) green bars indicate the number of MC simulations where SGL-CGD 
has a lower FPR than SGL. Red bars indicate the opposite. Bottom row: As above, but for SNP 
selection performance. 
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We first consider patliway selection performance (top row of Figure 6). For both methods, 
the same number of pathways are selected on average, across all effect sizes (Table 1). At 
low effect sizes, there is no difference in performance between the two methods for the large 
majority of MC simulations, and where there is a difference, the two methods are evenly 
balanced. As with SGL Simulation Study 1, this is the region (with 7 < 0.04) where pathway 
selection fairs no better than chance. With 7 > 0.04, SGL-CGD consistently outperforms 
SGL, both in terms of pathway selection sensitivity and control of false positives (measured 
by FPR). 

To understand why, we turn to SNP selection performance (bottom row of Figure 6). At 
small effect sizes (7 < 0.04), in the small minority of simulations where the correct pathways 
are identified, SGL-BCGD tends to demonstrate greater power than SGL-CGD (Figure 6 
bottom left). However, this is at the expense of lower specificity (Figure 6 bottom right). 
These difference are due to the slightly larger number of SNPs selected by SGL-BCGD (see 
Table 1), which in turn is due to the 'screening out' of previously selected SNPs from the 
adjacent causal pathway during BCGD, as described previously. This results in the selection 
of a larger number of SNPs when any two overlapping pathways are selected by the model. 
In the case where two causal pathways are selected, SNP selection power is then likely to be 
higher, although at the expense of a greater number of false positives. 

When pathway effects are just on the margin of detectability (7 = 0.06), SGL-CGD is more 
often able to select both causal pathways, although this doesn't translate into increased SNP 
selection power. This is most likely because at this effect size neither model can detect SNPs 
with low MAF, so that SGL-CGD is detecting the same (overlapping) SNPs in both causal 
pathways. Note that once again SGL-BCGD typically has a higher FPR than SGL-CGD, 
since more SNPs are selected from non-causal pathways. 

As the effect size increases, the number of simulations in which SGL-CGD outperforms 
SGL-BCGD for SNP selection power grows, paralleling the former method's enhanced path- 
way selection power. This is again a demonstration of the screening effect with SGL-BCGD 
described previously. This means that SGL-CGD is more often able to select both causal 
pathways, and to select additional causal SNPs that are missed by SGL. These additional 
SNPs are likely to be those with lower MAF, for example, that are harder to detect with 
SGL, once the effect of overlapping SNPs are screened out during estimation using BCGD. 
Interestingly, as before SGL-CGD continues to exhibit lower false positive rates than SGL. 
This suggests that, with the simulated data considered here, the independence assumption 
offers better control of false positives by enabling the selection of causal SNPs in each and 
every pathway to which they are mapped. In contrast, where causal SNPs are successively 
screened out during the estimation using BCGD, too many SNPs with spurious effects are 
selected. 

The relative advantage of SGL-CGD over SGL-BCGD on all performance measures starts 
to decrease around 7 = 0.1, as SGL-BCGD becomes better able to detect all causal pathways 
and SNPs, irrespective of the screening effect. 

Pathway and SNP selection bias 

One issue that must be addressed is the problem of selection bias, by which we mean the 
tendency of SGL to favour the selection of particular pathways or SNPs under the null, where 
no SNPs influence the phenotype. Possible biasing factors include variations in pathway size, 
that is the number of SNPs mapping to a pathway, or varying patterns of SNP-SNP correla- 
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tions and gene sizes. Common strategies for bias reduction include the use of dimensionality 
reduction techniques and permutation methods [87, 35, 89, 16]. 

In earlier work we described an adaptive weight-tuning strategy, designed to reduce se- 
lection bias in a group lasso-based pathway selection method [66] . This works by tuning the 
pathway weight vector, w = (wi^W2^-..^wl)^ so as to ensure that pathways are selected 
with equal probability under the null. This strategy can be readily extended to the case of 
dual-level sparsity with the SGL. 

Our procedure rests on the observation that for pathway selection to be unbiased, each 
pathway must have an equal chance of being selected. For a given a, and with A tuned to 
ensure that a single pathway is selected, pathway selection probabilities are then described 
by a uniform distribution, 11/ = 1/L, for / = 1, . . . , L. We proceed by calculating an empirical 
pathway selection frequency distribution, n*(w), by determining which pathway will first be 
selected by the model as A is reduced from its maximal value, Xmaxi over multiple permutations 
of the response, y. This process is described in detail in Supplementary Information D. 

Our iterative weight tuning procedure then works by applying successive adjustments to 
the pathway weight vector, so as to reduce the difference, di = n^*(it;) — 11^, between 
the unbiased and empirical (biased) distributions for each pathway. At iteration r, we com- 
pute the empirical pathway selection probability distribution 11* (w^'^)), determine di for each 
pathway, and then apply the following weight adjustment 

wl^^^^ = wl^^ [1 - sign{di){r] - l)L^df] < ry < 1, / = 1, . . . , L. 

The parameter 77 controls the maximum amount by which each wi can be reduced in a single 
iteration, in the case that pathway / is selected with zero frequency. The square in the weight 
adjustment factor ensures that large values of \di \ result in relatively large adjustments to wi. 
Iterations continue until convergence, where X^^i < ^• 

Note that when multiple pathways are selected by the model, the expected pathway se- 
lection frequency distribution under the null will not be uniform. This is because pathways 
overlap, so that selection frequencies will reflect the complex distribution of overlapping genes, 
as indeed will unbiased empirical selection frequencies. We have shown previously in exten- 
sive simulations that this adaptive weight-tuning procedure gives rise to substantial gains in 
sensitivity and specificity with regard to pathway selection [66]. 

Pathway, SNP and gene ranking 

With most variable selection methods, a choice for the regularisation parameter. A, must be 
made, since this determines the number of variables selected by the model. Common strategies 
include the use of cross validation to choose a A value that minimises the prediction error 
between training and test datasets [31]. One drawback of this approach is that it focuses on 
optimising the size of the set, C, of selected pathways (more generally, selected variables) that 
minimises the cross validated prediction error. Since the variables in C will vary across each 
fold of the cross validation, this procedure is not in general a good means of establishing the 
importance of a unique set of variables, and can give rise to the selection of too many variables 
[85, 54]. For the lasso, alternative approaches, based on data subsampling or bootstrapping 
have been shown to improve model consistency, in the sense that the correct model is selected 
with a high probability [4, 54, 13]. These methods work by recording selected variables across 
multiple subsamples of the data, and forming the final set of selected variables either as 
the intersection of variables selected at each model fit, or by assessing variable selection 
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frequencies. Examples of the use of such approaches can be found in a number of recent gene 
mapping studies involving model selection using either the lasso or elastic net [17, 21, 56, 
85]. Motivated by these ideas, we adopt a resampling strategy in which we calculate pathway, 
gene and SNP selection frequencies by repeatedly fitting the model over B subsamples of 
the data, at fixed values for a and A. Each random subsample of size N/2 is drawn without 
replacement. Our motivation here is to exploit knowledge of finite sample variability obtained 
by subsampling, to achieve better estimates of a variable's importance. With this approach, 
which in some respects resembles the 'pointwise stability selection' strategy of Meinshausen 
and Biihlmann [54], selection frequencies provide a direct measure of confidence in the selected 
pathways in a finite sample. This resampling strategy also allows us to rank pathways, genes 
and SNPs in order of their strength of association with the phenotype, so that we expect the 
true set of causal variables to achieve a high ranking, whereas non-causal variables will be 
ranked low. 

For pathway ranking, we denote the set of selected pathways at subsample b by 

CC) = : /3f ) ^ 0} b=l,...,B, 

where /3|^^^ is the estimated SNP coefficient vector for pathway / at subsample b. The selection 
probability for pathway / measured across all B subsamples is then 

b=i 

where the indicator function, I^^^ = 1 if / G C^^\ and otherwise. Pathways are ranked in 
order of their selection probabilities, nf^^^ nf^^^. 

For SNP and gene ranking, we denote the set of SNPs selected at subsample b (in the 
unexpanded variable space) by S^^\ and further denote the set of selected genes to which 
the SNPs in 5^^) are mapped by 0^^^ C $, where $ = {1, . . . , G} is the set of gene indices 
corresponding to all G mapped genes. Using the same strategy as for pathway ranking, we 
obtain an expression for the selection probability of SNP j across B subsamples as 

SNP _ jL V T^^^ 
^3 ~ B ^ ^ 

b=l 

where the indicator function, jj^^ = 1 if j G S^^\ and otherwise. A similar expression for 
the selection probability for gene g is 

1 ^ 

gene ^ ±_ ST^ T^{b) 
9 B 9 

b=l 

where the indicator function, Kg^^ = 1 if ^ G 0^^^, and otherwise. SNPs and genes are then 
ranked in order of their respective selection frequencies. 

Results 

Subjects, genotypes and phenotypes 

The analysis is carried out using data from two separate cohorts of Asian adults. These 
datasets have previously been used to search for novel variants associated with type 2 dia- 
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betes mellitus (T2D) in Asian populations. The first (discovery) cohort is from the Singa- 
pore Prospective Study Program, hereafter referred to as 'SP2', and the second (rephcation) 
dataset is from the Singapore Malay Eye Study or 'SiMES'. Detailed information on both 
datasets can be found in Sim et al. [68], but we briefly outline some salient features here. 

Both datasets comprise whole genome data for T2D cases and controls, genotyped on 
the Illumina HumanHap 610 Quad array. For the present study we use controls only, since 
variation in lipid levels between cases and controls can be greater than the variation within 
controls alone. The use of both cases and controls in our analysis might then lead to a 
confounded analysis, where any associations could be linked to T2D status or some other 
spurious factor. 

The SP2 dataset consists entirely of ethnic Chinese, and shows no evidence of population 
stratification. The SiMES dataset comprises ethnic Malays, and shows some evidence of 
cryptic relatedness between samples. For this reason, the first two principal components 
of a PC A for population structure are used as covariates in our analysis of this dataset. 
Again full details of the stratification analysis can be found in Sim et al. [68] and associated 
supplementary information. 

A summary of information pertaining to genotypes for each dataset, both before and after 
imputation and pathway mapping, is given in Table 2, along with a list of phenotypes and 
covariates. 

Table 2: Genotype and phenotype information corresponding to the SP2 and SiMES datasets used in 
the study. 



SP2 Simes 

Sample size AT = 1, 040 A/" = 1, 099 

Genotypes 

Before imputation 

SNPs available for analysis^^) 542, 297 557, 824 

SNPs with missing genotypes^^) 152, 372 282, 549 

Post imputation 

SNPs available for analysis^^) 492, 639 515, 503 

Phenotypes / covariates 

quantitative trait (phenotype) HDLC HDLC 

covariates gender, age, age^, gender, age, age^, 

BMI(^) BMI, PCI, PC2(6) 

after first round of quality control [68] and removal of monomorphic SNPs 
^^Waximum 5% missing rate per SNP 

after imputation and removal of SNPs with MAF< 0.01 
^4)mg/dL 

^^^body mass index (kg/m^) 

principal components relating to cryptic relatedness 
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Genotype imputation 

After the initial round of quality control, genotypes for both datasets have a maximum SNP 
missingness of 5%. Since our method cannot handle missing values, we perform 'missing 
holes' SNP imputation, so that all missing SNP calls are estimated against a reference panel 
of known haplotypes. 

SNP imputation proceeds in two stages. First, imputation requires accurate estima- 
tion of haplotypes from diploid genotypes (phasing). This is performed using SHAPEIT vl 
(http://www.shapeit.fr). This uses a hidden Markov model to infer haplotypes from sample 
genotypes using a map of known recombination rates across the genome [18]. The recombi- 
nation map must correspond to genotype coordinates in the dataset to be imputed, so we 
use recombination data from HapMap phase II, corresponding to genome build NCBI b36 
(http:/ /hapmap. ncbi.nlm.nih.gov/downloads/recombination/2008-03_rel22_B36/). 

Following the primary phasing stage, SNP imputation is performed using IMPUTE v2.2.2 
(http://mathgen.stats.ox.ac.uk/impute/impute_v2.html). IMPUTE uses a reference panel 
of known haplotypes to infer unobserved genotypes, given a set of observed sample haplo- 
types [37]. The latest version (IMPUTE 2) uses an updated, efficient algorithm, so that a 
custom reference panel can be used for each study haplotype, and for each region of the 
genome, enabling the full range of reference information provided by HapMap3 [77] to be 
used. Following IMPUTE 2 guidelines, we use HapMap3 reference data corresponding to 
NCBI b36 (http://mathgen.stats.ox.ac.uk/impute/data_download_hapmap3_r2.html) which 
includes haplotype data for 1,011 individuals from Africa, Asia, Europe and the Americas. 
SNPs are imputed in 5MB chunks, using an effective population size {Ne) of 15,000, and a 
buffer of 250kb to avoid edge effects, again as recommended for IMPUTE 2. 

The phasing and imputation process is complex and computationally intensive. For this 
reason we implement a pipeline in Python, with phasing and imputation for each chromosome 
conducted in parallel across multiple nodes in a computing cluster. This enables full genome 
imputation that would otherwise take days, to be completed in a matter of hours. 

Pathway mapping 

Pathways GWAS methods rely on prior information mapping SNPs to functional networks 
or pathways. Since pathways are typically defined as groups of interacting genes, SNP to 
pathway mapping is a two-part process, requiring the mapping of genes to pathways, and 
of SNPs to genes. A consistent strategy for this mapping process has however yet to be 
established, a situation compounded by a lack of agreement on what constitutes a pathway 
in the first place [10]. 

The number and size of databases devoted to classifying genes into pathways is grow- 
ing rapidly, as is the range and diversity of gene interactions considered (see for example 
http://www.pathguide.org/). Databases such as those provided by KEGG (http://www. 
genome.jp/kegg/pathway.html), Reactome (http://www.reactome.org/) and Biocarta (http: 
//www. biocarta.com/) classify pathways across a number of functional domains, for exam- 
ple apoptosis, cell adhesion or lipid metabolism; or crystallise current knowledge on specific 
disease-related molecular reaction networks. Strategies for pathways database assembly range 
from a fully-automated text-mining approach, to that of careful curation by experts. In- 
evitably therefore, there is considerable variation between databases, in terms of both gene 
coverage and consistency [71], so that the choice of database(s) will itself influence results in 
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pathways GWAS. 

The mapping of SNPs to genes adds a further layer of complexity, since although many 
SNPs may occur within gene boundaries, on a typical GWAS array the vast majority of 
SNPs will reside in inter-genic regions. In an attempt to include variants potentially residing 
in functionally significant regions lying outside gene boundaries, SNPs may be mapped to 
nearby genes using various distance thresholds. Various values for SNP to gene mapping 
distances, measured in thousands of nucleotide base pairs (kb), have been suggested in the 
literature, ranging from mapping SNPs to genes only if they fall within a specific gene, to the 
attempt to encompass upstream promoters and enhancers by extending the range to 10, 20 
or even 500kb and beyond [87, 20, 10]. This process is illustrated schematically in Figure 7. 
Notable features of the SNP to pathway mapping process include the fact that genes (and 
therefore SNPs) may map to more than one pathway, and also that many SNPs and genes do 
not currently map to any known pathway [25]. 



pathways 



genes 



genotyped SNPs 




Figure 7: Schematic illustration of the SNP to pathway mapping process, (i) Genes (green circles) 
are mapped to pathways using information on gene-gene interactions (top row), obtained from a gene 
pathways database. Many genes do not map to any known pathway (unfilled circles). Also, some 
genes may map to more than one pathway, (ii) Genes that map to a pathway are in turn mapped 
to genotyped SNPs within a specified distance. Many SNPs cannot be mapped to a pathway since 
they do not map to a mapped gene (unfilled squares). Note SNPs may map to more than one gene. 
Some SNPs (orange squares) may map to more than one pathway, either because they map to multiple 
genes belonging to different pathways, or because they map to a single gene that belongs to multiple 
pathways. 

Following imputation, SNPs for both datasets in the present study are mapped to KEGG 
canonical pathways from the MSigDB database (http:/ /www. broadinstitute.org/gsea/msigdb/ 
index.jsp). We exclude the largest KEGG pathway (by number of mapped SNPs), 'Pathways 
in Cancer', since it is highly redundant in that it contains multiple other pathways as subsets. 
Details of the pathway mapping process are given in Figures 8 and 9. 

Note that there is a difference in the number of SNPs available for the pathway mapping 
between the two datasets, and this results in a small discrepancy in the total number of 
mapped genes (SP2: 4,734 mapped genes; SiMES: 4,751). However, both datasets map to all 
185 KEGG pathways, and a large majority of mapped genes and SNPs overlap both datasets. 
Detailed information on the pathway mapping process for the two datasets is presented in 
Table 3. 
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We perform pathways-driven SNP selection on both datasets, using the procedures described 
in Methods. We present results for each dataset separately below. 



Genes: GRCH36/hgl8 
21,004 genes 



SNP to gene mapping 




Pathways: KEGG 
185 Pathways containing 
5,267 distinct genes 



75,389 SNPs mapped to 4,734 genes and 185 pathways 



Figure 8: SP2 dataset SNP to pathway mapping. 



SNP to gene mapping 




Figure 9: SiMES dataset SNP to pathway mapping. 
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Table 3: Comparison of SNP and gene to pathway mappings for the SP2 and SiMES datasets. 





SP2 


SiMES 


Total SNPs mapping to pathways 


75,389 


78,933 


Total SNPs mapping to pathways in both datasets (intersection) 


74,864 


Total mapped genes 


4,734 


4,751 


Total genes mapping to pathways in both datasets (intersection) 


4,726 


Total mapped pathways 


185 


185 


Minimum number of genes mapping to single pathway 


11 


11 


Maximum number of genes mapping to single pathway 


63 


63 


Minimum number of SNPs mapping to single pathway 


66 


67 


Maximum number of SNPs mapping to single pathway 


5,759 


6,058 


Minimum number of pathways mapping to a single SNP 


1 


1 


Maximum number of pathways mapping to a single SNP 


45 


45 



SP2 Analysis 

For the SP2 dataset we consider two separate scenarios for the regularisation parameters A 
and a. For the two scenarios we set the sparsity parameter, A = O-QSA^^a;) but consider 
two values for a, namely a = 0.95,0.85. We test each scenario over 1000 N/2 subsamples. 
We also compare the resulting pathway and SNP selection frequency distributions with null 
distributions, again over 1000 N/2 subsamples, but with phenotype labels permuted, so that 
no SNPs can influence the phenotype. 

The parameter a controls how the regularisation penalty is distributed between the ^2 
(pathway) and ^1 (SNP) norms of the coefficient vector. Each scenario therefore entails 
different numbers of selected pathways and SNPs, and this information is presented in Table 
4. 

Table 4: Separate combinations of regularisation parameters, A and a used for analysis of the SP2 
dataset. For each A, a combination, the mean (±SD) number of selected pathways and SNPs across 
all 1000 subsamples is reported. 





a = 


A — 0.95\max 

= 0.85 a = 0.95 


empirical 










selected pathways 


7.9 


± 


6.1 


4.8 ±4.1 


selected SNPs 


1551 


± 


1294 


160 ± 185 


null 










selected pathways 


9.1 


± 


7.2 


5.0 ±4.55 


selected SNPs 


1656 


± 


1401 


155 ± 194 



Comparisons of empirical and null pathway selection frequency distributions for each 
scenario are presented in Figure 10. The same comparisons for SNP selection frequencies are 
presented in Figure 11. In these plots, null distributions (coloured blue) are ordered along 
the X-axis according to their corresponding ranked empirical selection frequencies (marked in 



21 



red). This is to help visuahse any potential biases that may be influencing variable selection 
(see below). 

To interpret these results, we begin by noting from Table 4 that many more SNPs are 
selected with a — 0.85, resulting in higher SNP selection frequencies, compared to those 
obtained with a — 0.95 (see Figure 11). This is as expected, since a lower value for a implies 
a reduced penalty on the SNP coeflicient vector, resulting in more SNPs being selected. 
Perhaps surprisingly, given that the ^2 group penalty (1 — a) A is increased, the number of 
selected pathways is also greater. This must reflect the reduced li penalty, which allows a 
greater number of SNPs to contribute to a putative selected pathway's coeflicient vector. This 
in turn increases the number of pathways that pass the threshold for selection. 

This raises the question of what might be considered to be an optimal choice for the 
regularisat ion-distributional parameter o;, since different assumptions about the number of 
SNPs potentially influencing the phenotype may affect the resulting pathway and SNP rank- 
ings. To answer this, we turn our attention to the pathway and SNP selection frequency 
distributions for each a value in Figures 10 and 11. At the lower value of a = 0.85 (top plots 
in Figures 10 and 11), empirical pathway and SNP selection frequency distributions appear 
to be biased, in the sense that there is a suggestion that pathways and SNPs with the highest 
empirical selection frequencies also tend to be selected with a higher frequency under the null, 
where there is no association between genotype and phenotype. This relationship appears to 
be diminished with a — 0.95, when fewer SNPs are selected by the model. We investigate 
this further by plotting empirical vs. null selection frequencies as a sequence of scatter plots 
in Figure 12, and we report Pearson correlation coefficients and p- values for these in Table 5. 

Table 5: SP2 dataset: Pearson correlation coefficients (r)and p- values for the data plotted in Figure 
12. n denotes the number of predictors considered. For SNPs, coefficients describe correlations for all 
predictors selected with nonzero empirical selection frequencies only, since a large number of SNPs are 
not selected by the model at any subsample. 







a = 0.85 






a = 0.95 






n 


r p- 


-value 


n 


r p- 


value 


pathways 


185 


0.66 1.3 


X 10-24 


185 


0.26 2.9 


X 10-4 


SNPs 


62, 965 


0.37 





30, 027 


0.11 1.2 


X 10-84 



These provide further evidence of increased correlation between empirical and null selec- 
tion frequency distributions at the lower a value for both pathways and SNPs, again sug- 
gesting increased bias in the empirical results, in the sense that certain pathways and SNPs 
tend to be selected with a higher frequency, irrespective of whether or not a true signal may 
be present. Further qualitative evidence of reduced bias with o: = 0.95 is suggested by the 
clearer separation of empirical and null distributions at the higher a value in Figures 10 and 
11. For example, the maximum empirical pathway selection frequency is reduced by a factor 
of 0.29 (0.35 to 0.25) as a is increased from 0.85 to 0.95, whereas the maximum pathway 
selection frequency under the null is reduced by a factor of 0.81 (0.29 to 0.054). Similarly for 
SNPs, the maximum empirical SNP selection frequency is reduced by a factor of 0.37 (0.52 to 
0.33), whereas the maximum SNP selection frequency under the null is reduced by a factor 
of 0.9 (0.11 to 0.011). 

The increased bias with a — 0.85 is most likely due to the selection of too many SNPs, 
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pathway (ranked by empirical selection frequency) 

(a) 



0.5 




pathway (ranked by empirical selection frequency) 
(b) 

Figure 10: Empirical and null pathway selection frequency distributions for all 185 KEGG pathways 
with the SP2 dataset. For each scenario, pathways are ranked along the x-axis in order of their 
empirical pathway selection frequency, tt^^^^ >,...,> Trf^^^. (a) a = 0.85. (b) a = 0.95. 
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Figure 11: Empirical and null SNP selection frequency distributions with the SP2 dataset. For each 
scenario, SNPs are ranked along the x-axis in order of their empirical pathway selection frequency, 
jj-SNP y TTj^^^ > — (a) a = 0.85. (b) a = 0.95. Note fewer SNPs are selected with nonzero empirical 
selection frequency with a = 0.95, so that the x-axis range in (b) is reduced. 
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Figure 12: SP2 dataset: Scatter plots comparing empirical and null selection frequencies presented in 
Figures 10 and 11. (a) and (h): Pathway selection frequencies with a = 0.85, 0.95 respectively, (c) and 
(d): SNP selection frequencies for the same a values. For clarity, SNP selection frequencies are plotted 
for the top 1000 SNPs (by empirical selection frequency) only. Corresponding correlation coefficients 
(for all ranked SNPs) are presented in Table 5. Note that pathway and SNP selection frequencies are 
much higher at the lower a value (left hand plots), since many more variables are selected (see Table 
4.) 
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in the sense that many selected SNPs do not exhibit real phenotypic effects. These extra 
SNPs effectively add noise to the model, in the form of multiple weak, spurious signals. This 
in turn will add bias to the resulting selection frequency distributions, tending to favour, for 
example, SNPs that overlap multiple pathways, and the pathways that contain them. As a 
is increased, we would expect this biasing effect to be reduced, until a point where too few 
SNPs are selected, when there is then a risk that some of the true signal may be lost. 

Note that the reduced but still significant correlations between empirical and null selection 
frequency distributions at o; = 0.95 in Table 5 are not unexpected. These may reflect the 
complex overlap structure between pathways, meaning that pathways (and associated SNPs) 
with a relatively high degree of overlap with other pathways, due for example to the presence 
of so called 'hub genes', are more likely to harbour true signals, as well as spurious ones [48, 
11, 42]. 

Taking all the above into consideration, we choose to report results with a = 0.95, where 
there is less evidence of bias due to the selection of too many SNPs. The top 30 pathways, 
ranked by selection frequency are presented in Table 6, and the top 30 ranked SNPs, together 
with corresponding genes to which they are mapped are presented in Table 7. Versions of 
these tables extending to lower ranks are provided as supplementary information. 
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Table 7: SP2 dataset: Top 30 SNPs and genes, respectively ranked by SNP and gene selection fre- 
quency. Genes falling in the top 30 ranks of the consensus gene set, ^244^5 obtained by comparing 
gene ranking results from both SP2 and SiMES datasets (see Table 13), are marked with a *. 
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IFNARl 
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94 


8 


rs2834204 


0.31 


IFNARl 


PIK3R1 
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9 


rs3181224 
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IL12B 
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10 
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PIAS2 
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SiMES Analysis 



For the replication SiMES dataset, we repeat the above analysis design, but consider only 
the 'low bias' scenario where A = 0.95Xmax and a = 0.95. Once again we test each scenario 
over 1000 N/2 subsamples, and compare the resulting pathway and SNP selection frequency 
distributions with null distributions generated over 1000 N/2 subsamples with phenotype 
labels permuted. Pathway and SNP selection frequency distributions are presented in Figure 
14. An investigation of pathway and SNP selection bias is presented in the form of scatter plots 
illustrating potential correlation between empirical and null selection frequencies in Figure 13, 
with corresponding Pearson correlation coefficients and p- values presented in Table 8. The 
top 30 ranked pathways, and SNPs and genes are presented in Tables 9 and 10 respectively. 





Figure 13: SiMES dataset: Scatter plots comparing empirical and null pathway (left) and SNP (right) 
selection frequencies presented in Figure 14. For clarity, SNP selection frequencies are plotted for the 
top 1000 SNPs (by empirical selection frequency) only. 



Table 8: SiMES dataset: Pearson correlation coefficients (r)and p-values for the data plotted in Figure 
13. n denotes the number of predictors considered. For SNPs, coefficients describe correlations for all 
predictors selected with nonzero empirical selection frequencies only, since a large number of SNPs are 
not selected by the model at any subsample. 



n r p-value 

pathways 185 -0.094 0.20 
SNPs 20,006 0.058 2.63 x 10" 
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0.5 




pathway (ranked by empirical selection frequency) 



0.5 




SNP (ranked by empirical selection frequency) 

Figure 14: Empirical and null pathway (top) and SNP {bottom) selection frequency distributions 
for the SiMES dataset. a = 0.95. For both empirical (red) and null (blue) distributions, variables 
(pathways and SNPs) are ranked along the x-axis in order of their empirical selection frequencies. 
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Table 10: SiMES dataset: Top 30 SNPs and genes, respectively ranked by SNP and gene selection 
frequency. Genes falling in the top 30 ranks of the consensus gene set, ^244 obtained by comparing 
gene ranking results from both SP2 and SiMES datasets (see Table 13), are marked with a *. 





SNP RANKING 


GENE RANKING 




QIVP 


^SNP 


Mapped gene(s) 


Gene 


—gene 
71 


^ mappea oiNr s 


1 


rs2636698 


0.31 


PPA2 


PPA2 


0.31 


16 


2 


rs2726503 


0.31 


PPA2 


PDSS2 


0.26 


59 


3 


rs2713829 


0.31 
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GABARAPLl 
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4 
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0.31 
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5 
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13 


rs6568474 


0.26 


PDSS2 


NDUFA4 


0.10 


7 


14 


rs9386622 


0.26 


PDSS2 


DGKH 


0.10 


70 
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rs6938393 


0.15 


PDSS2 


UGSTS"" 


0.08 


40 


29 


rsl3202332 


0.13 


PDSS2 


ALDH2 


0.08 


12 


30 


rs9480754 


0.13 


PDSS2 


SDHB 


0.08 


13 



32 



Comparison of ranked pathway lists 

We now consider the problem of comparing rankings (pathways, genes and SNPs) obtained for 
each dataset. To do this we require some measure of distance between each pair of hsts. Ideahy 
this measure should place more emphasis on differences between highly-ranked variables, since 
we expect the association signal, and hence agreement between the ranked lists, to be strongest 
there. By the same reasoning, we expect there to be little or no agreement between variables 
at lower rankings, where selection frequencies are low. Indeed a consideration of empirical 
and null selection frequency distributions (Figures 10 and 14, top) suggests that only the very 
top ranked variables are likely to reflect any true signal, so that we would additionally like 
our distance metric to be able to accommodate consideration of the top-A: variables only, with 
k < where p is the total number of variables ranked in either dataset. One complication 
with top-A; lists is that they are partial, in the sense that unlike complete {k = p) lists, a 
variable may occur in one list, but not the other. 

In order to consider this problem, we introduce the following notation. We denote the 
complete set of ranked predictors by £ = {1, . . . and begin by assuming that all variables 
are ranked in both datasets. We denote the rank of each variable in list 1 by t(z), z = 1, . . . 
so that r{5) — 1 if variable 5 is ranked first and so on. The corresponding ranks for list 2 are 
denoted by cr(z),z = 1, . . . A suitable metric describing the distance between two top-A; 
rankings is the Canberra distance [44], 



^ ^ ^ mm{T(z), A: + 1} + mm{a(z), A: + 1} ^ ^ 



This has the properties that we require, in that the denominator ensures more emphasis is 
placed on differences in the ranks of highly ranked variables in either dataset. Furthermore, 
this distance measure allows comparisons between partial, top-A; lists, since a variable occur- 
ring in one top- A; list but not the other is assigned a ranking of A: + 1 in the list from which 
it is missing. Note also that a variable i that is not in either of the top-A: ranks, that is 
r{i),a(i) > A:, makes no contribution to Ca{k,r,a). 

In order to gauge the extent to which the distance measure (6) differs from that expected 
between two random lists, we require a value for the expected Canberra distance between two 
random lists, which we denote E[Ca{k,p)]. Jurman et al. [44] derive an expression for this 
quantity, and we use this to compute the normalised Canberra distance. 

Note that this has a lower bound of 0, corresponding to exact agreement between the lists. 
For two random lists, the upper bound will generally be close to 1, although it can exceed 
1, particularly for small A;, since the expected value for random lists is not necessarily the 
highest value. 

We illustrate the variation of the normalised Canberra distance (7) between SP2 and 
SiMES pathway rankings in the left hand plot in Figure 15 (blue curve). We consider all 
possible top-A: lists, k — 1, . . . , 185 since all 185 pathways are ranked in both datasets. In the 
same plot, we also show 
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obtained by comparing empirical SP2 rankings (r) against Z — 10, 000 permutations of the 
SiMES pathway rankings, cr^,7r = 1,..., 10, 000 (green curve). This latter curve confirms 
that the expected value, E\Ca{k^'p)\ is indeed a good measure of Ca in the random case 
where there is no agreement between rankings. 
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Figure 15: Comparison of top- A: SP2 and SiMES pathway rankings. Left hand plot: Variation of 
normalised Canberra distance, Ca* with k (7) (blue curve). Corresponding mean values over Z = 
10,000 permutations of SiMES rankings (8) (green curve). Right hand plot: FDR values (blue 
curve). Dotted green line shows the threshold for FDR control at the 5% level. 



Using the same permuted rankings, cr^, we next test the null hypothesis that the observed 
normalised Canberra distance, Ca*(A:, r, a), is not significantly different from that between r 
and a random list a^, by computing a p- value as 

1 ^ 

7T=1 

for A: = 1, ... , 185. We then obtain FDR q- values using the Benjamini-Hochberg procedure 
[5] and illustrate these for each k in the right hand plot of Figure 15. FDR is controlled at 
a nominal 5% level for 19 < A: < 71, indicating that the distance between the top- A; pathway 
rankings for both datasets is significantly different from the random ranking case for a wide 
range of possible values of k. The distance Ca* between SP2 and SiMES pathway rankings 
however attains its minimum value when A: = 25 with q(25) = 0.037, so that on this measure, 
the two pathway rankings are in closest agreement when we consider the top 25 pathways in 
each ranked list only. Some intuitive understanding of why this might be so can be gained by 
considering the empirical vs. null pathway selection frequency distributions for each dataset 
in Figures 10 (b) and 14 (top). Here we see that the separation between empirical and null 
selection frequencies is most clear for values of k below around 30 for SP2, and around 15 for 
SiMES. 

If we assume that the two pathway rankings are indeed in closest agreement when A: = 25, 
then one means of obtaining a consensus set of important pathways is to consider their 
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intersection, 

= ■■ r-'ii) < 25} n {j : a-\j) < 25}, 
from which we can obtain a set of average rankings as 



Both the intersection set, ^J^^ ^ and ordered average rankings, ^fs^^ for the two datasets 
under consideration are shown in Table 11. We additionahy mark the consensus set ^^^^^/^ 
with asterisks in Tables 6 and 9. 

Table 11: Consensus set of pathways, ^25*^5 ^-^^ SiMES datasets with k = 25. Consensus 
pathways are ordered by their average rankings in V^fs^^- 



Pathway Average rank ('^fs^^ 

Dilated Cardiomyopathy 4.5 

Hypertrophic Cardiomyopathy 7.5 

T Cell Receptor Signaling Pathway 11.0 

Terpenoid Backbone Biosynthesis 11.0 
Arrhythmogenic Right Ventricular Cardiomyopathy 12.0 

Ribosome 13.0 

Ppar Signaling Pathway 18.5 



Comparison of ranked gene and SNP lists 

A number of factors complicate the comparison of ranked gene and SNP lists across both 
datasets. Firstly, sets of mapped SNPs and genes differ slightly between the two datasets 
(see Table 2). Secondly, even if we consider only those variables mapped in both datasets, 
different, though overlapping sets of variables are ranked in each. Thirdly, ranked variables are 
not independent [44]. For example, genes may be grouped into pathways, so that a reordering 
of genes within a pathway might be considered less significant than a reordering of genes 
mapping to different pathways. Similarly a reordering of SNPs mapping to a single gene 
might be considered less significant than a reordering of SNPs mapping to different genes. 

In order to compute a distance measure between pairs or ranked lists, we therefore make 
two simplifying assumptions. First, we consider only variables ranked in one or both datasets. 
This seems reasonable, since we can necessarily only compile a distance measure from variables 
that are ranked in one or both datasets. Second, we assume that variables are independent. 
This makes our distance measure conservative, in the sense that it will treat all reordering of 
SNPs or genes equally, irrespective of any potential functional relationship between them. 

With these assumptions in mind, we begin by denoting the set of all variables (genes 
or SNPs) that are ranked in either dataset by C = {l,...,p*}. We further denote the 
corresponding sets of ranked variables for SP2 and SiMES datasets by Cr and respectively. 
We then have the following set relations: Cr^C^ d C] Cr ^ C^] and \Cr\ ^ \C(t\- 

We now extend the previous Canberra distance measure to encompass the above set 
relations. We begin, as before, by defining two ranked lists corresponding to the rankings 
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of all the variables in C for each dataset, although this time we must account for the fact 
that not all variables in C are ranked in both. We denote SP2 rankings by T(i), z = 1, . . . 
where t(z) is the rank of variable i if i ^ Cr, and t(z) = p* otherwise. SiMES rankings are 
defined in the same way, and denoted by cr(z), z = 1, . . . 

Applying this revised ranking scheme, we can then define a top-fc normalised Canberra 
distance (6) as 

for any k < min{|>Cr|, |>Ccr|}. The restriction on k follows from the fact that we cannot distin- 
guish between top-A: rankings for all k > min{\Cr\^ l^cr|}- 



Gene rankings 

Information summarising the relationship between the two ranked lists of genes is given in 
Table 12. 

Table 12: Summary of genes analysed and ranked in SP2 and SiMES datasets. 





SP2 SiMES 


number of genes mapped to pathways 

number of genes mapping to both datasets 

number of ranked genes ( ^t-M'^o- ) 

number of genes ranked in either dataset (p*) 

number of genes ranked in both datasets ( £t H -C^- ) 


4,734 4,751 

4,726 
3,430 2,815 

3,913 

2,332 



We consider normalised Canberra distances, Ca*(/c, r, a), for A: = 1, ... , 500 only, and plot 
these in Figure 16 (left, blue curve), along with Ca* (A:, r, a) (8) for Z = 10, 000 permutations 
of the SiMES pathway rankings, a^, tt = 1, . . . , 10, 000 (green curve). Once again this latter 
curve confirms that the expected value, £'[Ca(A:,p*)], is indeed a good measure of Ca in the 
random case where there is no agreement between rankings. We also plot FDR values using 
the same procedure as described previously for pathways. FDR is controlled at a nominal 5% 
level for all A: > 13 in the region tested (1 < A: < 500). The distance Ca* between SP2 and 
SiMES pathway rankings attains its minimum value when k = 244, so that on this measure, 
the two gene rankings are in closest agreement when we consider the top 244 pathways in 
each ranked list only. 

Following the same strategy as implemented for pathways, we then form the consensus 
set, ^^244^' average rankings '0244^- The consensus set contains 84 genes, and we list the 
top 30 genes ordered by their average rank in the two datasets, in Table 13. 
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Table 13: Top 30 consensus genes ordered by their average rank, ipi 



xianK 


Gene 


Average rank 
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ZO.O 
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ZO.O 


10 


IGr IH 


30.5 


11 


PAK7 


36.5 


12 


ADCY8 


37.5 


13 


VAV2 


41.0 


14 


SLC8A1 


41.5 


15 


CACNB2 


42.5 


16 


CACNA2D1 


43.0 


17 


ITGA9 


44.0 


18 


KRAS 


47.5 


19 


MAPKIO 


50.5 


20 


CACNAIS 


51.0 


21 


VAV3 


54.0 


22 


PLCG2 


55.5 


23 


BCL2 


57.0 


24 


CD80 


60.0 


25 


ITGAll 


60.5 


26 


CTNNA2 


61.0 


27 


ALDHIBI 


61.5 


28 


MGST3 


63.0 


29 


NEDD4L 


63.0 


30 


PRKAG2 


66.0 
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Figure 16: Comparison of top-A: SP2 and SiMES gene rankings, for /c = 1, . . . ,500. Left hand plot: 
Variation of normalised Canberra distance, Ca* with k (9) (blue curve), and corresponding mean 
values over 10,000 permutations of SiMES rankings (8) (green curve). Right hand plot: FDR g'- values 
(blue curve). Dotted green line shows the threshold for FDR control at the 5% level. 



SNP rankings 

Information summarising the relationship between the two ranked lists of SNPs is given 
in Table 14. In contrast to both pathway and gene rankings, it is apparent that relatively 
few ranked SNPs overlap both datasets - 8,151 out of 41,452 SNPs that are ranked in either 
dataset. This results in values for Ca*(/c) that are close to 1, corresponding to the random 
list case, over a wide range of possible values for k (data not shown). 
For this reason, we compute a simple summary measure 

i^'''''^C-^^^:jeCrnC.} (10) 

and report only the top ranking SNPs using this measure in Table 15. 

Table 14: Summary of SNPs analysed and ranked in SP2 and SiMES datasets. 





SP2 SiMES 


number of SNPs mapped to pathways 

number of SNPs mapping to both datasets 

number of ranked SNPs ( ^r U'^o- ) 

number of SNPs ranked in either dataset (p*) 

number of SNPs ranked in both datasets ( H ) 


75,389 78,933 

74,864 
30,027 20,006 

41,452 

8,581 
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Table 15: Top 30 SNPs ranked in both SP2 and SiMES datasets, ranked in order of mean ranking, 



rank 


SNP 


^SNP 


maj 


1 


rs897799 


133.0 


COX6B2 


2 


rs2126953 


203.0 


ITGAl 


3 


rs7714110 


213.0 


ADCY2 


4 


rs6924886 


274.5 


PDSS2 


5 


rs2447867 


275.5 


ITGAl 


6 


rs9386622 


283.0 


PDSS2 


7 


rs6568474 


283.5 


PDSS2 


8 


rsl0446497 


349.5 


PAK2 


9 


rs6583177 


385.5 


PAK2 


10 


rs4765961 


429.0 


CACNAIC 


11 


rs759440 


457.0 


PDSS2 


12 


rsl0457161 


465.0 


PDSS2 


13 


rsl0462842 


479.5 


ADCY2 


14 


rs9373932 


529.0 


PDSS2 


15 


rsl2206487 


532.0 


LAMA2 


16 


rs743567 


543.0 


MYH7 


17 


rsl2472674 


543.0 


CTNNA2 


18 


rsll759792 


557.5 


PDSS2 


19 


rs9373924 


566.0 


PDSS2 


20 


rs319070 


623.0 


PDSS2 


21 


rs751877 


630.0 


ADCY4 


22 


rs2047698 


714.0 


PDGFD 


23 


rs4804505 


727.5 


PDE4A 


24 


rsl2672417 


764.0 


SMURFl 


25 


rs7766689 


800.5 


LAMA2 


26 


rsl57694 


860.0 


MAP3K7 


27 


rs554192 


878.0 


NEDD4L 


28 


rs2746543 


896.5 


SDHB 


29 


rs742257 


942.0 


LAMB3 


30 


rsl798619 


944.5 


PAK2 



mapped gene(s) 



ILll 



LTB4R RIPK3 



KEAPl 
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Discussion 



We have outlined a method for the detection of pathways, SNPs and genes associated with a 
quantitative trait. Our method uses a sparse regression model, the sparse group lasso, that 
enforces sparsity at the pathway and SNP level. As well as identifying important pathways, 
this approach is designed to maximise the power to detect causal SNPs, possibly of low effect 
size, that might otherwise be missed if pathways information is ignored. In a simulation study 
we demonstrated that where causal SNPs are enriched within a single causal pathway, SGL 
does indeed have greater SNP selection power, compared to an alternative sparse regression 
model, the lasso, that disregards pathways information. These results mirror previous findings 
that support the intuition that a sparse selection penalty that promotes dual-level sparsity is 
better able to recover the true model in these circumstances [26, 69]. 

We then argued from a theoretical standpoint that where individual SNPs can map to 
multiple pathways, a modification (SGL-CGD) of the standard SGL-BCGD estimation algo- 
rithm that treats pathways as independent, may offer greater sensitivity for the detection of 
causal SNPs and pathways. A potential concern is that this gain in power may be accompa- 
nied by an inflated number of false positives. However, in a simulation study with overlapping 
pathways we found relative gains in both sensitivity and specificity, under the independence 
assumption. This gain in specificity was unexpected, and appears to arise directly from treat- 
ing pathways as independent in the model estimation. As with the group lasso, the ability 
of SGL to recover the true model is likely to be affected by the complexity of the pathway 
overlap structure [60], although we expect that the gains in power and sensitivity achieved 
with the independence assumption will also be apparent with real data. 

Our method combines the SGL model and SGL-CGD estimation algorithm with a weight- 
tuning algorithm to reduce selection bias, and a resampling technique designed to provide a 
robust measure of SNP, gene and pathway importance in a finite sample. As such, the latter 
is expected to confer advantages, in terms of the down ranking of unimportant predictors, 
previously observed for the lasso [54, 13]. Once again it would be interesting to explore this 
further using simulations derived from real pathway and genotype data. 

We do not explore the issue of determining a selection frequency threshold for the control of 
false positives here. In principal such a threshold could be determined by comparing empirical 
selection frequency distributions with those obtained under the 'null', through permutations, 
although this is not a trivial exercise [83]. An alternative method for error control has been 
investigated in the context of lasso selection [54], but the direct application of this approach 
to the present case is not feasible, since overlapping pathways make clear distinctions between 
causal and noise variables problematic. We instead develop a heuristic measure of ranking 
performance in our application study identifying SNPs and pathways associated with serum 
high-density lipoprotein cholesterol levels (HDLC). Firstly, by comparing empirical and null 
pathway and SNP rankings for each dataset, we gain some confidence that pathway and SNP 
signals captured in the top rankings can be distinguished from those arising from noise or 
spurious associations. Secondly, we take advantage of the fact that we are able to compare 
results from two independent GWAS datasets. On the assumption that similar patterns of 
genetic variation are likely to impact HDLC levels in both cohorts, we set a ranking threshold 
based on computing distances between ranked lists from each dataset. 

Interestingly, when a comparison between empirical and null rankings is made with a 
reduced value for the regularisation parameter a, there is evidence of selection bias, in the 
sense that pathways and SNPs tend to be highly ranked both empirically and under the 
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null (see Figures 10 and 11). Since a smaller a corresponds to a greater number of SNPs 
being selected at each subsample, this would seem to suggest that too many SNPs are being 
selected. In this case, pathway and SNP rankings may in part reflect spurious associations, 
with a bias towards SNPs overlapping multiple pathways. 

There are other potentially interesting areas to explore with regard to the subsampling 
method used here. For example, standard approaches consider only the set of variables 
selected at each subsample, and ignore potentially relevant information captured in the co- 
efficient estimates themselves. The use of this additional information would result in a set 
of ranked lists, one for each subsample, and the joint consideration of these lists has the 
potential to provide a more robust measure of variable importance, by taking account of the 
relative importance of each variable for each subsample [64, 47, 43]. 

Turning to the study results, we conduct two separate analyses on independent discov- 
ery and replication datasets. Since subjects from both datasets are genotyped on the same 
platform, the large majority of SNPs mapping to pathways in one dataset do so also in the 
other dataset. Thus 99.3% of SNPs mapping to pathways in the SP2 dataset are similarly 
mapped in the SiMES dataset. For the SiMES dataset, the corresponding figure is 94.8%. 
As expected, the concordance of gene coverage is even greater. Thus 99.8% of mapped genes 
in the SP2 dataset are also mapped in the SiMES dataset, and 99.5% of mapped genes in 
the SiMES dataset are also mapped in SP2. This large overlap in gene (and pathway) cov- 
erage between datasets is likely to occur even when datasets are genotyped on different SNP 
arrays. Indeed this is one advantage of methods such as the one described here that enable 
comparisons between pathway and gene rankings. 

We obtain consensus pathway and gene rankings by considering only the top k ranks 
in each dataset, with k obtained as the value that minimises the distance between the two 
rankings. We additionally derive a significance measure for each top-/c distance by comparing 
empirical distances against a null distribution obtained by permuting ranks in one list. We 
note that this can only be an approximation of the true null, since in reality rankings for 
both datasets may be influenced by the extent to which genes and SNPs overlap multiple 
pathways. However, some support for the reasonableness of this approximation can be gained 
from our earlier analysis, showing that the correlation between empirical and null pathway 
and SNP rankings is low, so that rankings under the null are indeed approximately random. 

Considering the consensus pathway rankings in Table 11, three out of the seven consensus 
pathways (ranked 1, 2 and 5), are related to cardiomyopathy. These three pathways are the 
only cardiomyopathy-related pathways amongst the 185 KEGG pathways used in our analysis, 
so it is noteworthy that all three fall within the consensus pathway rankings. The link between 
HDLC levels and cardiomyopathy is already well established [1, 29, 81, 24, 27]. Furthermore, 
numerous references in the literature also describe the links between lipid metabolism and T 
cell receptor (consensus pathway ranking 3) and PPAR signahng (rank 7) [40, 9, 73, 6]. 

Turning to a consideration of the top 30 consensus genes and SNPs presented in Tables 
13 and 15 (and see also pathway ranking tables 6, 9 and 11, and extended results in supple- 
mentary information). We found that many are enriched in one of several gene families: 

1. L-type calcium channel genes, including C^CTV^iC, CACNAIS, CACNA2D1, CACNA2D3 
and CACNB2 

2. Adenylate cyclase genes, including ADCY2, ADCY4 and ADCY8 

3. Integrin and laminin genes, including ITGAl, ITGA9, ITGAll, LAMA2, and LAMAS 
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4. MAPK signaling pathway genes, including M APR 10 and MAP3K7 



5. Immunological pathway genes, including PAK2, PAK7, PRKCA, PRKCB, VAV2 and 
VAV3 

These genes are highly enriched in several high ranking pathways from both datasets. 
Notably, the focal adhesion pathway alone has 12 gene hits, as does the dilated cardiomyopa- 
thy pathway. Cardiomyopathy pathways as a whole have 30 genes hits (several of the genes 
overlap more than one cardiomyopathy pathway). 10 of these genes feature in the MAPK 
signahng pathway, while GnRH (8 genes), T and B cell receptor (8), calcium (7), ErbB (5), 
and Wnt signling (4) pathways also contain several genes in the list. To elucidate the bio- 
logical relevance of these gene families and the connections between them, we investigated 
their known functional links with cardiovascular phenotypes (not restricted to HDLC) by ref- 
erencing the KEGG and Genetic Association (http:/ /geneticassociationdb. nih.gov) databases. 

Voltage dependent L-type calcium channel gene family 

The genes in this family encode the subunits of the human voltage dependent L-type calcium 
channel (CaVl). The a-1 subunit (encoded by CACNAIC, AIS, A2D1 and A2D3 in our 
study) determines channel function in various tissues. CaVl function has significant impact 
on the activity of heart cells and smooth muscles. For example, patients with malfunctioning 
CaVl develop arrhythmias and shortened QT interval [72, 2, 75]. Furthermore, CACNAIC 
polymorphisms have been associated with variation in blood pressure in Caucasian and East 
Asian populations by pharmacogenetic analysis. In 120 Caucasians, 3 SNPs in this gene were 
significantly associated with the response to a widely applied antihypertensive CaVl blocker 
[8]. Kamide et al. [45] also found that polymorphisms in CACNAIC were associated with 
sensitivity to an antihypertensive in 161 Japanese patients. The CaVl /3 subunit encoding 
CACNB2 has also been associated with blood pressure [49]. 

This gene family was mapped to several pathways in our study, with the KEGG dilated 
cardiomyopathy pathway achieving highest rank both within individual datasets, and in the 
consensus pathway rankings. Dilated cardiomyopathy is the most common form of cardiomy- 
opathy, and features enlarged and weakened heart muscles. Although high levels of serum 
HDLC lowers the risk of heart disease [12, 81], there is still no direct evidence that CaVl is 
involved in HDLC metabolism. 

Adenylate cyclase gene family 

Three adenylate cyclases genes, ADCY2, ADCY4 and ADCY8 were highly ranked in our 
study. Currently, there are no reported associations of these genes with cardiovascular disease 
or lipid levels. Adenylate cylcase genes catalyse the formation of cyclic adenosine monophos- 
phate (cAMP) from adenosine triphosphate (ATP), while cAMP servers as the second mes- 
senger in cell signal transduction. Note that ADCY2 is insensitive to calcium concentration, 
suggesting that any association of this gene family with HDLC levels may not be due to any 
interactions with the CaVl gene family. 

Among high ranking pathways, ADCY2 and ADCY8 feature in the dilated cardiomyopa- 
thy pathway. Although ADCY4 was not in the top 30 consensus genes, rs751877 in this gene 
was among the top 30 consensus SNPs. 
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Integrin and laminin gene families 
We found 3 genes encoding integrin subunits in our study. Integrins hook to the extracellular 
matrix (ECM) from the cell surface, and are also important signal transduction receptors 
which communicate aspects of the cell's physical and chemical environment [58]. Interestingly, 
laminins are the major component of the ECM, and are relevant to the shape and migration of 
almost every type of tissue. Both of these two families of genes are therefore highly relevant to 
the survival and shape of heart muscles. A recent GWAS conducted in a Japanese population 
confirmed a previous association between ITGA9 and blood pressure in European populations 
[74]. 

Integrin family genes and LAMA2 were selected primarily within high-ranking cardiomy- 
opathy, focal adhesion and ECM receptor signaling pathways, with once again the dilated 
cardiomyopathy pathway achieving the highest ranks. However, evidence for LAMAS asso- 
ciation is weaker, since it was not in the top 30 consensus genes, although a SNP from the 
LAMB 3 (laminin /3-3) gene was ranked 29 in the consensus SNP list. 

MAPK signaling pathway 

TAKl {MAP3K7) and JNK3 {MAPKIO) are kinases which regulate ceh cycling. They acti- 
vate or depress downstream transcription factors which mediate cell proliferation, differenti- 
ation and inflammation. 

JNK activity has been associated with obesity in a mouse model, where the absence of 
JNKl {MAPK8), a protein in the same family as MAPKIO, protects against the obesity- 
induced insulin resistance [33] . The negative correlation between HDLC level and obesity has 
been well accepted [36] . 

Immunological pathways 

PAK {PAK2 and PAK7) genes feature in the high ranking T cell signaling pathway in both 
SP2 and SiMES datasets. PRKC (including PRKCA and PRKCB), along with VAV {VAV2 
and VAV 3) genes also feature in various high ranking immunological pathways including T 
cell signaling. Pathogenic Escherichia Coli Infection and Natural Killer Cell Mediated Cyto- 
toxicity. Genes from all 3 of these families are frequently top ranked in these pathways. 

PAK and VAV are activated by antigens, and regulate the T cell cytoskeleton, indicating 
a possible impact on T cell shape and mobility. In a candidate gene association analysis, 
PRKCA was reported to be associated with HDLC at a nominally significant level, but was 
not significant after adjusting for multiple testing [51]. 

In summary, genes enriched in the above gene clusters and pathways may be relevant to 
heart muscle cell signal transduction, shape and migration, and may thus have functional 
relevance to the onset of cardiovascular diseases. Many highly ranked genes in our study 
are also involved in neurological pathways. For example polymorphisms in CACNAIC have 
been associated with bipolar disorder, schizophrenia and major depression [22, 55, 30]. This 
points to an interesting hypothesis that serum HDLC levels might be regulated not only 
by metabolism but also by neurological pathways, although the elucidation of any putative 
biological mechanism underlying such an association obviously exceeds the scope of this study. 

Besides the gene families and associated pathways discussed above, a notable feature of the 
top 30 consensus SNP ranking results presented in Table 15 is the inclusion of 9 SNPs mapping 
to the PDSS2 gene. PDSS2 achieves its high ranking in the consensus SNP list, through its 
inclusion in the highly ranked terpenoid backbone biosynthesis pathway in SiMES, and this 
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gene is in fact the second highest consensus ranking gene with this dataset. In contrast, this 
gene is ranked low (216) in the SP2 dataset, which explains why it fails to make the top 30 
consensus gene rankings. PDSS2 encodes subunit 2 of prenyl diphosphate synthase, which 
determines the length of the isoprenoid chain of coenzyme QIO (CoQlO) [61]. A deficiency in 
biosynthesis of CoQlO has previously been associated with delayed motor development and 
abnormal renal function, with excess serum lipids [70]. 

Despite the well established links between lipid metabolism and PPAR signaling noted 
above, no genes in this high-ranked pathway fall in the top 30 gene rankings for either dataset 
(see Tables 6 and 15). This could be because the association signal in this pathway is more 
widely distributed, compared to other high ranking pathways, perhaps indicating hetero- 
geneity in genetic causal factors within our sample, so that different genes and SNPs are 
highlighted in different subsamples. This would result in reduced gene selection frequencies. 
Also, genes that overlap multiple putative causal pathways are more likely to be selected in 
a given subsample, meaning that associated genes mapping to pathways with relatively few 
overlaps may have lower selection frequencies. This may be the case with genes in the PPAR 
signaling pathway, whose 63 genes map to an average 2.7 ±1.8 pathways. As a comparison, 
the 84 genes in the top-ranked dilated cardiomyopathy pathway map to an average 7.2 ± 3.8 
pathways. 

Our study failed to highlight HDLC-associated SNPs identified in previous GWAS (see 
for example www.genome.gov/gwastudies for an up to date list). A primary reason for this 
is that the large majority of SNPs identified in previous studies do not map to pathways in 
our study, either because they fall in intergenic regions, or because they do not feature on 
the Illumina arrays used here. In addition our method is designed to highlight distributed, 
small SNP effects that accumulate across gene pathways, and so will likely fail to identify 
those SNPs with significant marginal effects targeted by GWAS. Furthermore, where there 
are common mechanisms affecting phenotypes in both cohorts, we would expect to observe 
the most concordance between the two studies at the pathway level, followed by genes, and 
lastly SNPs. Indeed this increased heterogeneity at the SNP, and to a lesser extent at the 
gene level is one motivation for adopting a pathways approach in the first place [35, 34, 
10]. This reduced concordance at the SNP level may be due to increased heterogeneity of 
genetic risk factors between the two datasets. Another important source of variation in SNP 
selection frequencies is LD between SNPs. The within-pathway lasso penalty will tend to 
select one of a group of highly correlated SNPs at random, reducing SNP selection frequencies 
within LD blocks harbouring causal SNPs. An alternative approach would be to consider a 
different penalty within selected pathways, for example the elastic net [92], which selects 
groups of correlated variables jointly, although this comes at the cost of introducing a further 
regularisation parameter to be tuned. 

Finally, as with all pathways analyses, a number of limitations with this general approach 
should be noted. Despite great efforts, pathway assembly is still in its infancy, and the 
relative sparsity of gene-pathway annotations reflects the fact that our understanding of how 
the majority of genes functionally interact is at an early stage. As a consequence, annotations 
from different pathways databases often vary [71], so that the choice of pathways database will 
impact results [19, 10]. Results are also subject to bias resulting from SNP to gene mapping 
strategies, so that for example SNP to gene mapping distances will affect the number of 
unmapped SNPs falling within gene 'deserts' [20]; SNPs may map to relatively large numbers 
of genes in gene rich areas of the genome; and the mapping of a SNP to its closest gene 
may obscure a true functional relationships with a more distant gene [87]. Indeed recent 
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research from the ENCODE project indicates that functional elements may in fact be densely 
distributed throughout the genome [7, 62], and this information has the potential to radically 
alter future pathways analysis. These issues, together with the fact that PGAS methods 
are by construction designed to highlight distributed, moderate to small SNP effects, serve 
to further illustrate the point that pathways analysis should be seen as complementary to 
studies searching for single markers [86]. 
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Supplementary Information 
A SGL estimation algorithm 

For the estimation of (3^^^ we proceed by noting that the optimisation (1) is convex, and 
(in the case of non-overlapping groups) that the penalty is block-separable, so that we can 
obtain a solution using block, or group- wise coordinate descent [82]. For a single group, /, the 
corresponding minimising function is given by 

/(A) = ^l|y - X/3||i + (1 - a)\wim\2 + aAllAlli- (11) 
An optimal solution for SNP coefficient /3j is then derived from the subgradient equations 

~ ^A^i - X^x/e^/c - Xj/3j) + (1 - oi)\wiSj + aXtj = j = li,...Jp^, (12) 

where /3k^k ^ j are the current estimates for other SNP coefficients in group /, and the group 
partial residual, f ^ = y — J^m^l^^f^^- Here sj and tj are the respective subgradients of 
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2 and with 



' if||All2 = o 

fsign(/3,) 
' if/3, = 0. 

If = 0, that is group / is not selected by the model, then from (12) 

- x^-r/ + (1 - a)XwiSj + aXtj = 0, j = /i, . . . .Ip^. (14) 

Substituting a = X^f ^ gives 

ttj = (1 — a)XwiSj + aAtj, j = /i, . . . , /p^ 

so that 



5 



2_ 1 r^. ^,\-,\2 



(1 - ayx^wf 

and 



. ^ (1 - a)2A2it;2 -^^ 
From (13), when /3/ = 0, ||5||2 = (Z^j-^j)^ ^ that 

^(a^- - aAt^-)^ < (1 - Q^)^A2it;f . (15) 

Also from (13), one further condition when /3/ = is that ij G [—1,1]. The values, ij that 
minimise the left hand size of (15) are therefore given by 



a 



t 



Substituting for aj, we can then write the values for aj — aXtj that minimise the left hand 
side of (15) as 



aj — aXtj — 



if Ix'-r/l < aX 

sign(x^.r^; 

= S'(x^-r/,aA) 



sign(x^.r^)(|x^.r/| - aX) if |x^.r/| > aX 



for j = /i, . . . , where 

S'(x^-r/,aA) = sign(x^-r/)(|x^-r/| - aA)+ (16) 

is the lasso soft thresholding operator. Finally, we can now rewrite the condition for f3i — 0^ 
(15) as 

\\S{y.[vuaX)\\2<{l-a)Xwu (17) 
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Where the vector ^(X^r/, aX) = [^(x^^r^, aA), . . . , ^(x^^r^, aX)]. Note that with a = 0, this 
reduces to the group lasso group selection criterion. 

In the case that /3/ ^ 0, that is group / is selected by the model, from (12) and (13) we 
see that /3j = when 

-x;.(f;-5]xfc4) < |aA|. (18) 

For completeness, we rewrite the criterion for selecting pathway I from (17) as 

||5(Xjr;,aA)||2 > (l-a)A«;, (19) 
and the criterion for selecting SNP j in selected pathway / from (18) as 

|x^-f/j| > aX (20) 

where vij = ^l~J2kj^j ^kPk is the SNP partial residual, obtained by regressing out the current 
estimated effects of all other predictors in the model, except for predictor j. 

A number of methods for the estimation of /3/ in the case that HAlb 7^ have been 
proposed [26, 23, 50, 69]. A complicating factor is the discontinuities in the first (and second) 
derivatives of Sj at ||A||2 = 0, that is where HAlb first moves away from zero, and of tj when 
/3j = 0. As with GL, Friedman, Hastie, and Tibshirani [26] describe a numerical method 
using coordinate descent, by combining a golden search over /3j with parabolic interpolation. 
However we find this too computationally intensive for the large datasets we wish to analyse. 
Simon et al. [69] propose an accelerated, block gradient descent method in which f3i is itera- 
tively updated in a single step along the line of steepest descent of the block objective function 
until convergence. We instead use a block, coordinate-wise gradient descent (BCGD) method 
that uses a Newton update, similar to that proposed by Zhou et al. [91], and we describe this 
below. 

To update /3j from its current estimate, we note from (12) and (13) that if ^ 0, the 
subgradient equation for predictor j is given by 

dj = -x;.(r^ - X/A) + (1 - a)Xwi-§-- + aX • sign(/3,). (21) 

IIAII2 

We then descend along the gradient at (3j towards the minimum using Newton's method. 
The Newton update, is then given by 

(9- 

(22) 



where d' ^ 1 + ^-^^^ (l - ^) 



IIAII2 V ||/3,||2. 

is the derivative of (21) at $j. The update (22) is repeated until convergence. 

We must also deal with the case where $j =0. Here we adopt a slightly different strategy, 
since the partial derivative, tj of /3j is not continuous. We avoid this discontinuity by testing 
the 'directional derivatives', dj~ and , respectively representing the partial derivatives at 
Pj = in the direction of increasing and decreasing Recalling that we are dealing with 
the case HAlb 7^ 0, at /3j = the group penalty term in (21) disappears. That is, once a 
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group is selected by model it becomes easier for each SNP coefficient to move away from zero. 
The two directional derivatives are then given by 



(23) 



Since the minimising function (11) is convex, there are three possible outcomes, and we 
substitute for dj in (22) accordingly: 

{dj- if dj- > and 9+ > 
a+ if < and 9+ < (24) 
if dJ- > and < 

In the third case, /(A) is increasing either side of /3j = 0, so that /3j must remain at zero. 
We can then proceed with the standard Newton update (22). 

Finally, since the Newton update may occasionally overstep the minimum (where dj = 0), 
a simple remedy proposed by Zhou et al. [91] is to check that /(A) is decreasing at each 
iteration. If this is not the case, then the step size in (22) is halved. The complete algorithm 
for SGL estimation using BCGD is presented in Box 1. 

One remaining practical issue is the obtaining of a value for Xmax^ the smallest value of 
A at which no groups are selected by the model. Noting that f / = y when no groups are 
selected, from (17) we obtain the smallest value, A™^, for the minimum value of A at which 
group / is not selected as 

^ ||g(X^y,aA— )||2 

We can solve this in its quadratic form by first setting an upper bound for A at the point A^*, 
where the soft thresholding function S(K[yi, aX) = 0, that is when no SNPs are selected by 
the model. We then obtain the solution by solving 

||5(X^y,«Ar")||^ - (1 - a)2(Ar")V = < < A^ (26) 

for Ap^^, where 

|x'-y| 

A;* max , j = /i, . . . 

J <y 

We do this using the Id root-finding function, brentq, in Python's scipy library. Finally, we 
obtain a value for Xmax as 

A^,, = maxAr", / = 1,...,L. (27) 



B SGL with overlaps 

We assume that X and /3 have been expanded to account for overlaps, but we drop the * 
notation for clarity. We proceed as before by solving the block-separable optimisation (4) 
for each group or pathway in turn. However, for overlapping pathways, the assumption 
of pathway independence requires that each X^, (/ = 1, . . . ,L) is regressed against the full 
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phenotype vector y rather than the partial residual, With this in mind, the revised 
subgradient equations for group / (12) are given by 

~ - X^x/e^/e - Xj/3j) + (1 - a)XwiSj + aXtj = j = /i, . . . , /p^. (28) 

The estimation for group / then proceeds as described in the previous section, but with 
the partial residual f/ replaced by y, so that the group sparsity condition (17) for ||/3;||2 = 
becomes 

\\S{X'iy,aX)\\2<{l-a)Xwi. (29) 

As before, where group / is selected by the model, the update for with current estimate 
Pj, is derived from the partial derivative (21), which under the independence assumption is 
given by 

dj = -x;.(y - X^A) + (1 - a)Xwi-§-- + aX • sign(/3,), (30) 

IIAII2 

for j = /i, . . . The Newton update (22) remains the same. When $j = 0, the revised 
directional derivatives (23) are given by 

a; = -x;.(y-X,A) + aA 

As before the conditions for SNP sparsity within a selected group are determined by (24). 

The value of Xmax^ the smallest A value at which no group is selected by the model, is 
determined in the same way as before, since this procedure (described in (25), (26) and (27)) 
does not depend on f/. 

Importantly, since each group is regressed independently against the phenotype vector y, 
there is no block coordinate descent stage in the estimation, that is the revised algorithm 
utilises only coordinate gradient descent within each selected pathway. For this reason we 
use the acronym SGL-CGD for the revised algorithm. The new algorithm is described in Box 
2. Note that since the block coordinate descent stage is avoided, the new algorithm has the 
added benefit of being much faster than would otherwise be the case. 



C Simulation study 1 

A baseline phenotype, y is sampled from A/'(10, 1). To generate SNP effects, we first select a 
single pathway, Gi, random. From this pathway we randomly select 5 SNPs to from the 
set 5 C ^/ of causal SNPs. At each MC simulation we generate a genetic effect and adjust y 
so that 

y"^ = y + w 

where 

kes 

Here 6 controls the overall additive genetic effect on phenotype y due to all casual SNPs in 5, 
and (k determines the contribution from causal SNP k, with J2keS Ck — ^- Iii our simulations 
we maintain a constant overall genetic effect size, 

7 = EH/E(y) 
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across all affected phenotypes, so that 7 represents the proportionate increase in the mean 
value of y due to all genetic effects. We also set = 1/5, for A: G 5, so that the contribution 
from each causal SNP allele is equal. This enables us to determine 5 for a given 7 as 

^ ^ 57E(y) 

Note that for constant 7, the proportionate effect on the mean value of y due to SNP k is 
MAP dependent, and is given by 2Smk/E{y). 

D Weight tuning for bias reduction 

Por fixed a, and with A tuned to select a single pathway, we need to establish which pathway 
enters the model first, as A is reduced from its maximal value, Xmax- Prom (27), at phenotype 
permutation r, the pathway Cr selected with permuted phenotype is given by 

4 = argmaxAp^, ,/ = l,...,L. 

^rrnn jg obtained by solving 

^ \\S{X!^r,a\rn\\2 
' {l-a)wi 

using the procedure described at the end of Section A. For R permutations of the phenotype 
vector, y, the empirical pathway selection frequency distribution is then given by 

1 ^ 

^*(^)-^Y.iCr^l}, 1^1,. ..,L. 

r=l 
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